NBA Matches

Objetivo

Considerando o crescente uso de ciência dos dados no mercardo esportivo e de especulação, nesta semana vocês farão parte de uma startup que quer quebrar os sites de apostas da NBA!

O mercado online de apostas foi avaliado em US$85.047 no ano de 2019 e pode ter um crescimento ainda maior nos próximos anos levando em consideração a posição favorável de alguns governos com a legalização das plataformas e pagamento de impostos. [1]

Com isso, a startup de vocês, RodaRodaBet, após um estudo inicial sobre o mercado de apostas americano e dos dados disponíveis online sobre a NBA [2], está buscando a construção de um modelo que possa indicar se os times da casa irão ganhar ou perder em cada rodada da liga.

Neste desafio, vocês irão utilizar dados raspados da NBA & ABA League Index, que contém informações sobre os times que jogam em cada rodada da NBA, para prever se determinado time da casa vai ganhar ou perder (Win or Lose).

References:

1 - https://www.globenewswire.com/news-release/2020/08/31/2086041/0/en/Global-Sports-Betting-Market-Worth-85-Billion-in-2019-Industry-Assessment-and-Forecasts-Throughout-2020-2025.html

2 - https://towardsdatascience.com/predicting-the-2020-nba-playoffs-bracket-with-machine-learning-2030f80fa62c

3 - https://www.basketball-reference.com/leagues/

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(2021)
[2]:
df_test= pd.read_csv("test_without_label.csv")
df_train = pd.read_csv("train_full.csv")

Entendendo os dados

Pelo fato das variáveis serem as estatísticas dos jogos e por termos bastantes variáveis nesse sentido, optamos por não criar novas variáveis a partir delas. Decidimos investir na variável data. Observamos que a variável dia do ano foi bastante importante para os modelos testados e a partir disso, criamos outras variações, como dias da semana, dias do mês, entre outros.

Porcentagem de dados jogos vencidos e jogos perdidos

[3]:
# remove espaço nos nomes das colunas
df_train.columns = df_train.columns.str.strip()
df_test.columns = df_test.columns.str.strip()
[4]:
df_train.WinOrLose.value_counts()
[4]:
L    654
W    352
Name: WinOrLose, dtype: int64
[5]:
y = df_train.WinOrLose.value_counts()/df_train.WinOrLose.value_counts().sum() #frequencia absoluta
plt.bar(['L','W'],y)
plt.title('Frequencia absoluta de Vitorias e derrotas')
plt.show()
../../_images/Jupyter_NBA_Matches_Final_Games_10_0.png

Temos uma grande maioria de jogos perdidos, portanto eh necessario uma analise estratificada quando for treinar os modelos

[6]:
df_train.head()
[6]:
Game Data H_Team H_Wins H_Loss H_W/D % H_SRS H_Games H_TotalPoints H_AvgPointsPerGame ... A_TS% A_eFG% A_TOV% A_ORB% A_FT/FGA A_OeFG% A_OTOV% A_DRB% A_OFT/FGA WinOrLose
0 0 Thu, June 8 Miami Heat 52 30 0.634 3.59 82 8191 99.9 ... 0.550 0.495 13.1 31.8 0.285 0.475 13.7 72.2 0.257 L
1 1 Sun, June 11 Miami Heat 52 30 0.634 3.59 82 8191 99.9 ... 0.550 0.495 13.1 31.8 0.285 0.475 13.7 72.2 0.257 L
2 2 Tue, June 13 Dallas Mavericks 60 22 0.732 5.96 82 8130 99.1 ... 0.556 0.517 13.9 26.7 0.254 0.477 12.4 76.4 0.251 L
3 3 Thu, June 15 Dallas Mavericks 60 22 0.732 5.96 82 8130 99.1 ... 0.556 0.517 13.9 26.7 0.254 0.477 12.4 76.4 0.251 L
4 4 Sun, June 18 Dallas Mavericks 60 22 0.732 5.96 82 8130 99.1 ... 0.556 0.517 13.9 26.7 0.254 0.477 12.4 76.4 0.251 L

5 rows × 135 columns

Pre-processamento

Tratando as datas

[7]:
treino = df_train
teste  = df_test
[8]:
from datetime import datetime

# Na base de teste
for i in range(0, teste.shape[0]):
  teste['Data'].iloc[i] = datetime.strptime(teste['Data'].iloc[i], '%a, %B %d')
  teste['Data'].iloc[i] = datetime.strftime(teste['Data'].iloc[i], '%m-%d')

teste['Data'] = pd.to_datetime(teste['Data'], format="%m-%d", errors='raise')

#base de treino
for i in range(0, treino.shape[0]):
  treino['Data'].iloc[i] = datetime.strptime(treino['Data'].iloc[i], '%a, %B %d')
  treino['Data'].iloc[i] = datetime.strftime(treino['Data'].iloc[i], '%m-%d')

treino['Data'] = pd.to_datetime(treino['Data'], format="%m-%d", errors='raise')
C:\Users\msini\Anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

Criacao de algumas features com data

[9]:
teste['Dia'] = teste.Data.dt.day
treino['Dia'] = treino.Data.dt.day

teste['Dia'] = teste.Data.dt.day
treino['Dia'] = treino.Data.dt.day

teste['weekday'] = teste.Data.dt.weekday
treino['weekday'] = treino.Data.dt.weekday

teste['weekofyear'] = teste.Data.dt.weekofyear
treino['weekofyear'] = treino.Data.dt.weekofyear

teste['Dia do Ano'] = teste.Data.dt.dayofyear
treino['Dia do Ano'] = treino.Data.dt.dayofyear
<ipython-input-9-c7af2d33b494>:10: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  teste['weekofyear'] = teste.Data.dt.weekofyear
<ipython-input-9-c7af2d33b494>:11: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  treino['weekofyear'] = treino.Data.dt.weekofyear

Análise exploratória (Variáveis de datas)

[10]:
vars_dias = ['Dia', 'weekday', 'weekofyear']

for i in vars_dias:

  sns.set(style="ticks")

  x = treino[i]
  coluna = i
  mu = round(x.mean(),2) # mean of distribution
  sigma = round(x.std(),2)  # standard deviation of distribution

  f, (ax_box, ax_hist) = plt.subplots(2)

  sns.boxplot(x=x, ax=ax_box)
  sns.histplot(x=x, ax=ax_hist)

  ax_box.set(yticks=[])
  sns.despine(ax=ax_hist)
  sns.despine(ax=ax_box, left=True)
  ax_box.set_title('Boxplot e Histograma de {}\n $\mu={}$, $\sigma={}$'.format(coluna, mu,sigma))

plt.show()
../../_images/Jupyter_NBA_Matches_Final_Games_20_0.png
../../_images/Jupyter_NBA_Matches_Final_Games_20_1.png
../../_images/Jupyter_NBA_Matches_Final_Games_20_2.png

Weekofyear e Dia do ano possuem um formato de distribuição próximo.

Gráfico de barras (variáveis de datas)

[11]:
visu = sns.catplot(x = 'weekday', data = treino, hue ='WinOrLose', kind = 'count', margin_titles = True)
visu.set(xticklabels=[])
plt.show()
../../_images/Jupyter_NBA_Matches_Final_Games_23_0.png
[12]:
visu = sns.catplot(x = 'weekofyear', data = treino, hue ='WinOrLose', kind = 'count', margin_titles = True)
visu.set(xticklabels=[])
plt.show()
../../_images/Jupyter_NBA_Matches_Final_Games_24_0.png
[13]:
visu = sns.catplot(x = 'Dia do Ano', data = treino, hue ='WinOrLose', kind = 'count', margin_titles = True)
visu.set(xticklabels=[])
plt.show()
../../_images/Jupyter_NBA_Matches_Final_Games_25_0.png
[14]:
visu = sns.catplot(x = 'Dia', data = treino, hue ='WinOrLose', kind = 'count', margin_titles = True)
visu.set(xticklabels=[])
plt.show()
../../_images/Jupyter_NBA_Matches_Final_Games_26_0.png

Criacao da feature season (estacao do ano: Primavera, verão, outono e inverno)

lembrar que nos EUA as estacoes do ano sao diferentes

[15]:
teste['Season'] = teste.Data.dt.month%12 // 3 + 1
treino['Season'] = treino.Data.dt.month%12 // 3 + 1
[16]:
print(teste['Season'].value_counts())
print(treino['Season'].value_counts())

teste_total = teste.copy()
2    77
3    47
4    41
Name: Season, dtype: int64
2    924
3     82
Name: Season, dtype: int64
[17]:
y = teste['Season'].value_counts()/teste['Season'].value_counts().sum() #frequencia absoluta
plt.bar(['2','3','4'],y)
plt.title('Frequencia absoluta de Season dataframe de teste')
plt.show()

y = treino['Season'].value_counts()/treino['Season'].value_counts().sum() #frequencia absoluta
plt.bar(['2','3'],y)
plt.title('Frequencia absoluta de Season dataframe de treino')
plt.show()
../../_images/Jupyter_NBA_Matches_Final_Games_31_0.png
../../_images/Jupyter_NBA_Matches_Final_Games_31_1.png

Note que os jogos acontecem exclusivamente nas seasons 2, 3 e 4 e veja que no treino temos quase que exclusivamente os jogos acontecendo na season 2, indicando que essa variável talvez não seja muito interessantes para os modelos.

Retirando a coluna Game

[18]:
#teste
Id = teste.Game #sera utilizado para prever depois
teste = teste.iloc[:,1:]

#treino
treino = treino.iloc[:,1:]

Retirando a coluna Date

[19]:
teste.drop('Data', axis=1, inplace= True)

treino.drop('Data', axis=1, inplace= True)

Transformando os dados do tipo object ‘O’ para tipo int

[20]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
#base de treino
for i in range(0, len(treino.columns.values)):
  if treino.dtypes[i] == 'O':
    treino.iloc[:, i] = le.fit_transform(treino.iloc[:, i]).astype('int')

#Na base de test
for i in range(0, len(teste.columns.values)):
  if teste.dtypes[i] == 'O':
    teste.iloc[:, i] = le.fit_transform(teste.iloc[:, i]).astype('int')

Feature Selection

Feature Importance (Random Forest)

[21]:
#Divide the features into Independent and Dependent Variable
X = treino.drop('WinOrLose' , axis =1)
X_completo = X
y = treino['WinOrLose']
y_completo = y.copy()
[22]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

colunas = X.columns
scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
X = scaler_train.fit_transform(X)

#Nao precisa padronizar o teste pq estamos apenas vendo as features de importancia
[23]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.25)
model  = RandomForestClassifier()
model.fit(X_train, y_train)
[23]:
RandomForestClassifier()
[24]:
def plot_feature_importance(importance,names,model_type):

    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

    #Define size of bar plot
    plt.figure(figsize=(10,25))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')
[25]:
plot_feature_importance(model.feature_importances_,colunas,'Random Forest ')
../../_images/Jupyter_NBA_Matches_Final_Games_45_0.png

Selecionando k colunas por ordem de importancia

[26]:
    #Create arrays from feature importance and feature names
    importance = model.feature_importances_
    names = colunas

    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    #Resetando os index para poder selecionar as colunas desejadas
    fi_df.reset_index(inplace=True)
    #Selecionando o numero de colunas que deseja, por ordem de importancia
    select_colunas = fi_df.feature_names[0:14]
[27]:
list(select_colunas)
[27]:
['Dia',
 'Dia do Ano',
 'weekday',
 'weekofyear',
 'A_MOV',
 'A_SRS',
 'H_Wins',
 'A_W/D %',
 'A_FG%',
 'H_Loss',
 'H_eFG%',
 'H_TS%',
 'H_W/D %',
 'A_Loss']
[28]:
treino_completo = treino.copy()
teste_completo = teste.copy()

treino = treino[select_colunas]

teste = teste[select_colunas]
[29]:
treino.head()
[29]:
Dia Dia do Ano weekday weekofyear A_MOV A_SRS H_Wins A_W/D % A_FG% H_Loss H_eFG% H_TS% H_W/D % A_Loss
0 8 159 4 23 6.07 5.96 52 0.732 0.462 30 0.517 0.556 0.634 22
1 11 162 0 24 6.07 5.96 52 0.732 0.462 30 0.517 0.556 0.634 22
2 13 164 2 24 3.87 3.59 60 0.634 0.478 22 0.495 0.550 0.732 30
3 15 166 4 24 3.87 3.59 60 0.634 0.478 22 0.495 0.550 0.732 30
4 18 169 0 25 3.87 3.59 60 0.634 0.478 22 0.495 0.550 0.732 30

EDA

Correlation Heatmap

[30]:
plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(treino.corr(), dtype=np.bool))
heatmap = sns.heatmap(treino.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=16);
<ipython-input-30-4aa79b6928d7>:3: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  mask = np.triu(np.ones_like(treino.corr(), dtype=np.bool))
../../_images/Jupyter_NBA_Matches_Final_Games_53_1.png
[31]:
print(treino.shape)
print(y.shape)
(1006, 14)
(1006,)

Através das 14 primeiras variáveis selecionadas (apenas com a variável dia do ano, sem as outras derivações da Data), utilizamos o método de seleção forward para encontrar o melhor subconjunto de forma a obtermos o melhor resultado, considerando a métrica curva roc. Após observarmos qual foi o melhor subconjunto, fomos eliminando aquelas variáveis que estavam correlacionadas com alguma outra variável. Fizemos isso considerando apenas a variável Dia do Ano na hora de rodar o modelo randon forest (eliminando as outras variações da variável data) e obtivemos como melhores características, através do modelo Naive Bayes, as seguintes variáveis: ‘Dia do Ano’, ‘A_W/D %’, ‘A_FG%’, ‘H_MOV’, ‘H_eFG%’, ‘A_3P%’, ‘A_FT%’. Resultando no Score do Kaggle 0.727

Da mesma forma, fizemos o mesmo procedimento testando as outras variações das variáveis a partir da Data, eliminando a variável Dia do Ano, e obtivemos como melhores variáveis para o modelo de Naive Bayes: ‘Dia’, ‘weekday’, ‘weekofyear’, ‘H_eFG%’,‘A_W/D %’, ‘A_SRS’. E essas variáveis resultaram no melhor score do Kaggle: 0.729

Será reproduzido os resultados para a melhor acurácia que obtivemos no teste e que resultou na melhor classificação do kaggle.

Padronização e Train test split

[32]:
#As 14 melhores variáveis escolhidas pelo modelo random forest sem a variável dia do Ano
#col = ['Dia', 'weekday', 'weekofyear', 'A_Loss', 'H_eFG%', 'H_MOV', 'A_W/D %', 'H_SRS', 'A_MOV', 'A_Wins', 'A_SRS', 'H_TS%', 'H_W/D %', 'H_Loss']

treino = X_completo[select_colunas]
teste = teste_completo
[33]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
X = scaler_train.fit_transform(treino)

#Vamos padronizar o teste tbm
scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
teste = scaler_train.fit_transform(teste[select_colunas])
[34]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.25)

[36]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score

from mlxtend.feature_selection import SequentialFeatureSelector

feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=-1),
           k_features = 14,
           forward = True,
           verbose = 2,
           scoring = 'roc_auc',
           cv = 5)
[37]:
# o subconjunto formado por 8 variáveis foi o escolhido: score: 0.615483
features = feature_selector.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:   18.2s finished

[2021-10-09 17:31:45] Features: 1/14 -- score: 0.5970483694203371[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  13 out of  13 | elapsed:   13.1s finished

[2021-10-09 17:31:58] Features: 2/14 -- score: 0.5919707650839727[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:   12.4s finished

[2021-10-09 17:32:10] Features: 3/14 -- score: 0.5784259204407453[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:   12.4s finished

[2021-10-09 17:32:23] Features: 4/14 -- score: 0.5912465566778236[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   10.2s finished

[2021-10-09 17:32:33] Features: 5/14 -- score: 0.6063153490714137[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    9.1s finished

[2021-10-09 17:32:42] Features: 6/14 -- score: 0.6132538283818607[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    8.0s finished

[2021-10-09 17:32:51] Features: 7/14 -- score: 0.61611956103196[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    7.3s finished

[2021-10-09 17:32:58] Features: 8/14 -- score: 0.6224219513639999[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    6.1s finished

[2021-10-09 17:33:04] Features: 9/14 -- score: 0.6082610112259709[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.2s finished

[2021-10-09 17:33:09] Features: 10/14 -- score: 0.6064708539438998[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    4.4s finished

[2021-10-09 17:33:14] Features: 11/14 -- score: 0.5931933295814698[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.3s finished

[2021-10-09 17:33:17] Features: 12/14 -- score: 0.5899106957732295[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.2s finished

[2021-10-09 17:33:19] Features: 13/14 -- score: 0.5786436272622256[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s finished

[2021-10-09 17:33:21] Features: 14/14 -- score: 0.5741880424158052
[38]:
features.subsets_
[38]:
{1: {'feature_idx': (7,),
  'cv_scores': array([0.55236812, 0.67404698, 0.64054678, 0.5790335 , 0.53924647]),
  'avg_score': 0.5970483694203371,
  'feature_names': ('7',)},
 2: {'feature_idx': (7, 13),
  'cv_scores': array([0.54331921, 0.65941471, 0.64372353, 0.57787832, 0.53551805]),
  'avg_score': 0.5919707650839727,
  'feature_names': ('7', '13')},
 3: {'feature_idx': (1, 7, 13),
  'cv_scores': array([0.57691567, 0.5883712 , 0.61676935, 0.60271467, 0.50735871]),
  'avg_score': 0.5784259204407453,
  'feature_names': ('1', '7', '13')},
 4: {'feature_idx': (1, 7, 10, 13),
  'cv_scores': array([0.57653061, 0.55650751, 0.6380439 , 0.62726223, 0.55788854]),
  'avg_score': 0.5912465566778236,
  'feature_names': ('1', '7', '10', '13')},
 5: {'feature_idx': (1, 2, 7, 10, 13),
  'cv_scores': array([0.59655372, 0.53744705, 0.68569503, 0.61994609, 0.59193485]),
  'avg_score': 0.6063153490714137,
  'feature_names': ('1', '2', '7', '10', '13')},
 6: {'feature_idx': (1, 2, 3, 7, 10, 13),
  'cv_scores': array([0.62216018, 0.54688102, 0.68800539, 0.63573354, 0.57348901]),
  'avg_score': 0.6132538283818607,
  'feature_names': ('1', '2', '3', '7', '10', '13')},
 7: {'feature_idx': (1, 2, 3, 5, 7, 10, 13),
  'cv_scores': array([0.60117443, 0.59568733, 0.66316904, 0.6325568 , 0.5880102 ]),
  'avg_score': 0.61611956103196,
  'feature_names': ('1', '2', '3', '5', '7', '10', '13')},
 8: {'feature_idx': (0, 1, 2, 3, 5, 7, 10, 13),
  'cv_scores': array([0.60714286, 0.59145168, 0.66981132, 0.64882557, 0.59487834]),
  'avg_score': 0.6224219513639999,
  'feature_names': ('0', '1', '2', '3', '5', '7', '10', '13')},
 9: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 10, 13),
  'cv_scores': array([0.5732576 , 0.61532538, 0.6413169 , 0.62427801, 0.58712716]),
  'avg_score': 0.6082610112259709,
  'feature_names': ('0', '1', '2', '3', '4', '5', '7', '10', '13')},
 10: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 10, 11, 13),
  'cv_scores': array([0.57142857, 0.60329226, 0.63274933, 0.64276088, 0.58212323]),
  'avg_score': 0.6064708539438998,
  'feature_names': ('0', '1', '2', '3', '4', '5', '7', '10', '11', '13')},
 11: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 8, 10, 11, 13),
  'cv_scores': array([0.53475164, 0.60271467, 0.64767039, 0.61725067, 0.56357928]),
  'avg_score': 0.5931933295814698,
  'feature_names': ('0', '1', '2', '3', '4', '5', '7', '8', '10', '11', '13')},
 12: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 13),
  'cv_scores': array([0.55217559, 0.59520601, 0.63659992, 0.60964575, 0.55592622]),
  'avg_score': 0.5899106957732295,
  'feature_names': ('0',
   '1',
   '2',
   '3',
   '4',
   '5',
   '7',
   '8',
   '9',
   '10',
   '11',
   '13')},
 13: {'feature_idx': (0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13),
  'cv_scores': array([0.53696573, 0.59279938, 0.62283404, 0.60107817, 0.53954082]),
  'avg_score': 0.5786436272622256,
  'feature_names': ('0',
   '1',
   '2',
   '3',
   '4',
   '5',
   '7',
   '8',
   '9',
   '10',
   '11',
   '12',
   '13')},
 14: {'feature_idx': (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13),
  'cv_scores': array([0.52050443, 0.59279938, 0.62023489, 0.60512129, 0.53228022]),
  'avg_score': 0.5741880424158052,
  'feature_names': ('0',
   '1',
   '2',
   '3',
   '4',
   '5',
   '6',
   '7',
   '8',
   '9',
   '10',
   '11',
   '12',
   '13')}}
[39]:
#O melhor subconjunto foi: ['Dia', 'weekday', 'weekofyear', 'A_Loss', 'H_eFG%', 'A_W/D %', 'A_Wins', 'A_SRS']
cols = ['Dia', 'weekday', 'weekofyear', 'A_Loss', 'H_eFG%', 'A_W/D %', 'A_Wins', 'A_SRS']
treino = X_completo
treino = treino[cols]
[40]:
plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(treino.corr(), dtype=np.bool))
heatmap = sns.heatmap(treino.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=16);
<ipython-input-40-4aa79b6928d7>:3: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  mask = np.triu(np.ones_like(treino.corr(), dtype=np.bool))
../../_images/Jupyter_NBA_Matches_Final_Games_66_1.png

Através das correlações, optamos por eliminar as variáveis ‘A_Loss’ e ‘A_Wins’

[41]:
col = ['Dia', 'weekday', 'weekofyear', 'H_eFG%','A_W/D %', 'A_SRS']
treino = X_completo[col]
teste = teste_completo
[42]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
X = scaler_train.fit_transform(treino)

#Vamos padronizar o teste tbm
scaler_train = StandardScaler()
#scaler_train = MinMaxScaler()
teste = scaler_train.fit_transform(teste[col])

#treino e validação
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.25)

Naive Bayes

[43]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model_NB = GaussianNB()                       # 2. instantiate model
model_NB.fit(X_train, y_train)                # 3. fit model to data
y_predNB = model_NB.predict(X_test)            # 4. predict on new data

# calcula a acuracia

print('Acuracia Naivy bayes: {:.3f}'.format(balanced_accuracy_score(y_predNB, y_test)))
print("F1 score Naivy bayes: {:.3f}".format(f1_score(y_test, y_predNB, average = "weighted")))
print("Precision Naivy bayes: {:.3f}".format(precision_score(y_test, y_predNB, average = "weighted")))
Acuracia Naivy bayes: 0.632
F1 score Naivy bayes: 0.655
Precision Naivy bayes: 0.656
[44]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model_NB, X_train, y_train, cv=10)

print(cv_scores)
print("Media Cross-val accuracy: %f" % cv_scores.mean())
print("Variância: %f" % cv_scores.var())
[0.68421053 0.68421053 0.69736842 0.67105263 0.73333333 0.65333333
 0.72       0.66666667 0.62666667 0.65333333]
Media Cross-val accuracy: 0.679018
Variância: 0.000930
[45]:
from sklearn.model_selection import cross_validate

#cv = cross_validate(model_NB, X_train, y_train, return_train_score=True)
cv = cross_validate(model_NB, X, y, return_train_score=True, cv=10)

print(cv['test_score'].mean())
print(cv['train_score'].mean())
0.667950495049505
0.6770484065712926

SVM

[46]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

#Hiper parâmetros para otimizacao
C = np.arange(1,30)
gamma = ["scale", "auto"]
decision_function_shape = ["ovo", "ovr"]
k_fold = 10
#GridSearch para achar a melhor combinação de valores dos hiper parâmetros.
#   aplicando ainda uma validação cruzada com 10 folds.
model_svm = GridSearchCV(SVC(), cv = k_fold,
                     param_grid={"C": C, "gamma": gamma, "decision_function_shape": decision_function_shape})
model_svm.fit(X_train, y_train)
y_pred = model_svm.predict(X_test)



#Mensurar a qualidade do modelo ajustado
print("Acurácia SVM: {:.3f}".format(balanced_accuracy_score(y_test, y_pred)))
print("F1 score SVM: {:.3f}".format(f1_score(y_test, y_pred, average = "weighted")))
print("Precision SVM: {:.3f}".format(precision_score(y_test, y_pred, average = "weighted")))
Acurácia SVM: 0.574
F1 score SVM: 0.629
Precision SVM: 0.629
[47]:
#SVM
from sklearn.model_selection import cross_validate
cv = cross_validate(model_svm.best_estimator_, X, y, return_train_score=True, cv=10)
print(cv['test_score'].mean())
print(cv['train_score'].mean())
0.6252871287128713
0.7168105813910943

Submetendo NB

[48]:
y_pred = model_NB.predict(teste)
y_pred = np.array(y_pred, dtype = int)

prediction = pd.DataFrame()
prediction['Game'] = Id
prediction['WinOrLose'] = y_pred
prediction['WinOrLose']

d = {1: 'W', 0: 'L'}
prediction['WinOrLose'].replace(d,inplace = True)
[49]:
prediction.head()
[49]:
Game WinOrLose
0 0 W
1 1 W
2 2 L
3 3 L
4 4 W
[50]:
prediction['WinOrLose'].value_counts()
[50]:
L    130
W     35
Name: WinOrLose, dtype: int64
[51]:
y = prediction['WinOrLose'].value_counts()/prediction.WinOrLose.value_counts().sum()
plt.bar(['L','W'],y)
plt.title('Frequencia Absoluta Vitorias e derrotas')
plt.show()
../../_images/Jupyter_NBA_Matches_Final_Games_81_0.png
[52]:
prediction.to_csv('NB.csv', index = False)

Score no Kaggle: 0.729

Submetendo SVM

[53]:
y_pred = model_svm.predict(teste)
y_pred = np.array(y_pred, dtype = int)

prediction = pd.DataFrame()
prediction['Game'] = Id
prediction['WinOrLose'] = y_pred
prediction['WinOrLose']

d = {1: 'W', 0: 'L'}
prediction['WinOrLose'].replace(d,inplace = True)
[54]:
prediction.head()
[54]:
Game WinOrLose
0 0 L
1 1 L
2 2 L
3 3 L
4 4 L
[55]:
prediction['WinOrLose'].value_counts()
[55]:
L    136
W     29
Name: WinOrLose, dtype: int64
[56]:
y = prediction['WinOrLose'].value_counts()/prediction.WinOrLose.value_counts().sum()
plt.bar(['L','W'],y)
plt.title('Frequencia Absoluta Vitorias e derrotas')
plt.show()
../../_images/Jupyter_NBA_Matches_Final_Games_88_0.png
[57]:
prediction.to_csv('SVM.csv', index = False)